New test runner implementation #208

rojer · 2020-12-17T23:21:02Z

Simplified, added support for serializing state and resuming (not used yet but will be).

codecov · 2020-12-17T23:26:02Z

Codecov Report

Merging #208 (86b113d) into master (40dcae5) will decrease coverage by 0.44%.
The diff coverage is 85.24%.

@@            Coverage Diff             @@
##           master     #208      +/-   ##
==========================================
- Coverage   62.03%   61.58%   -0.45%     
==========================================
  Files          85       83       -2     
  Lines        4104     4056      -48     
==========================================
- Hits         2546     2498      -48     
- Misses       1235     1237       +2     
+ Partials      323      321       -2

Flag	Coverage Δ
integration	`56.58% <68.39%> (-2.57%)`	⬇️
integration_storage	`100.00% <ø> (ø)`
unittests	`50.73% <86.84%> (+19.33%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/cerrors/cerrors.go	`25.00% <0.00%> (-41.67%)`	⬇️
pkg/event/testevent/test.go	`80.76% <0.00%> (-19.24%)`	⬇️
pkg/pluginregistry/bundles.go	`51.72% <ø> (ø)`
pkg/storage/storage.go	`33.33% <ø> (ø)`
pkg/target/target.go	`38.09% <ø> (ø)`
plugins/storage/memory/memory.go	`91.48% <ø> (ø)`
tests/integ/jobmanager/common.go	`90.40% <ø> (-0.04%)`	⬇️
tests/plugins/teststeps/channels/channels.go	`0.00% <0.00%> (-50.00%)`	⬇️
tests/plugins/teststeps/hanging/hanging.go	`0.00% <ø> (-50.00%)`	⬇️
tests/plugins/teststeps/noreturn/noreturn.go	`66.66% <ø> (ø)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 40dcae5...86b113d. Read the comment docs.

rojer · 2020-12-17T23:30:20Z

will decrease coverage by 1.66%.

new runner is 91% covered by tests but since more code is deleted than added, overall coverage goes down.

rojer · 2020-12-18T13:41:40Z

clarified the description a bit, on feedback from @rihter007

marcoguerri · 2020-12-21T07:14:21Z

I haven't started full review yet, it will take a while. What this is missing is emissions of TargetIn (https://github.com/facebookincubator/contest/blob/master/pkg/runner/test_runner_route.go#L88) and TargetOut events (https://github.com/facebookincubator/contest/blob/master/pkg/runner/test_runner_route.go#L204), which are then used to rebuild the status of the job (https://github.com/facebookincubator/contest/blob/master/pkg/runner/job_status.go#L23-L28) from any instance, even those that do not own the job.

This is duplication of how we keep track of the state and we should converge into using the same approach both for resume and and Status queries.

rojer · 2020-12-21T13:49:41Z

@marcoguerri thanks! indeed, i did not pay sufficient attention to the events that should be emitted. this should be fixed now - i added TargetIn, TargetOut, TargetErr and made sure a TestError is emitted in the right cases.

tfg13

Mostly nits. Since it is a super large change, it would be good if someone else would also read it as well.
(I did not read the unit tests)

pkg/cerrors/cerrors.go

pkg/jobmanager/jobmanager.go

pkg/runner/goroutine_leak_check.go

pkg/runner/job_runner.go

pkg/runner/test_runner.go

tfg13 · 2021-01-15T16:31:04Z

pkg/runner/test_runner.go

+				// wait until error channel is emptied too.
+				outCh = make(chan *target.Target)
+				tr.safeCloseErrCh(ss)
+				break


Suggested change

break

break loop

actually, on further reflection - no, we don't want to break out of the loop just yet. as the comment above says, we want to make sure error channel also gets processed, which is exactly why we replace outCh with a new one and wait for ss.errCh to be closed as well.

Ah wasn't sure what you want to break out of, I guess you meant the select then?
Would continue loop be clearer in that case?

great idea, indeed that would make the intent more clear. done.

pkg/runner/test_runner.go

Simplified, added support for serializing state and resuming. Not used yet but will be.

pkg/cerrors/cerrors.go

pkg/runner/goroutine_leak_check.go

pkg/runner/job_runner.go

pkg/runner/test_runner.go

xaionaro · 2021-01-18T16:25:46Z

Just a warning: the quality of my review was poor this time. It mostly related to coding style (it is difficult to verify the code as it is for my brain).

xaionaro · 2021-01-18T17:21:44Z

P.S.: Also mutex-logic looks dangerous in multiple ways. I guess it might cause a deadlock somewhere in future (when somebody will modify the code).

rojer · 2021-01-18T18:50:42Z

P.S.: Also mutex-logic looks dangerous in multiple ways. I guess it might cause a deadlock somewhere in future (when somebody will modify the code).

well, this is overly general. previous implementation didn't use mutexes but was a maze of channels that was impossible to understand.
in this refactoring i tried to keep locking simple. in particular by having only one mutex, even though using separate per-target and per-step locks might've been more efficient but would make code considerably more complicated.

marcoguerri

Apologies for the late review. In fact, I am catching up only now, and I am also far from being done (for my reference, I shall resume from waitStepRunners). Even if this is pushed, I still want to finish leaving a first pass of comments today.

marcoguerri · 2021-01-19T07:55:57Z

pkg/runner/test_runner.go

+	CurPhase targetStepPhase `json:"cur_phase"` // Current phase of step execution.
+
+	res   error      // Final result, if reached the end state.
+	resCh chan error // Channel used to communicate result by the step runner.


Is this the channel used to communicate between stepRunner and targetHandler. Could you please clarify in the comment? I haven't read through the whole code yet and it's not clear.

pkg/runner/test_runner.go

marcoguerri · 2021-01-19T08:09:50Z

pkg/runner/test_runner.go

+	// Wait for step runners and readers to exit.
+	if err := tr.waitStepRunners(ctx); err != nil {
+		tr.log.Errorf("step runner error: %q, canceling", err)
+		stepCancel()


Does waitStepRunners guarantee that all steps have returned? If so, stepCancel should not be necessary. If not, steps should be awaited for if a cancellation signal is being propagated. We should keep track of any step that does not return within the timeout and flag it accordingly.

you are right, this is not needed as by this point all the step runners have either exited or we gave up waiting on them to exit.

marcoguerri · 2021-01-19T08:22:39Z

pkg/runner/test_runner.go

+		tr.log.Debugf("  %d %s %v", i, ts, stepErr)
+		if ts.CurPhase == targetStepPhaseRun {
+			inFlightTargets = append(inFlightTargets, ts)
+			if stepErr != statectx.ErrPaused {


If there is even only one target that failed in any step, we are declaring the test not resumable (but continuing the loop after we declare resumeOk false)? It should be fine to fail a target in a step, and still be able to resume the test.

stepState.runErr aka stepErr is not about target errors, it is

// Runner error, returned from Run() or an error condition detected by the reader.

i.e. it's an error condition that aborts entire run. presence of such errors making run non-resumable makes sense.

Well, ok, not sure why I missed that here we were referring to an error of the whole step.

pkg/runner/test_runner.go

marcoguerri · 2021-01-19T08:54:11Z

but was a maze of channels that was impossible to understand.

There was a lot of room for simplification in the previous implementation, in particular:

Not using context made several things overly complicated
All results, were waited by the test runner. There was a lot of room for simplification to give distribute responsibility to routing blocks, and simplify test runner (gave it a shot in https://github.com/facebookincubator/contest/pull/122/commits/c53e4269ef940b51fd6eed7d85dbf5656fe84000)
Running pipelines back to back (test + cleanup) also added quite a bit of complexity, that could be removed (commit above does this as well).

This just to say that in my opinion this solution has its own traps, as it makes data flow implicit (via mutex and global structure) rather than explicit, via channels. No questioning that the previous solution could be simplified, of course, but I personally find this solution not significantly easier to follow, with respect to the previous version with some unnecessary complexity removed.

xaionaro · 2021-01-19T09:00:45Z

This just to say that in my opinion this solution has its own traps, as it makes data flow implicit (via mutex and global structure) rather than explicit, via channels. No questioning that the previous solution could be simplified, of course, but I personally find this solution not significantly easier to follow, with respect to the previous version with some unnecessary complexity removed.

I have exactly the same impression.

rojer · 2021-01-19T11:42:36Z

@marcoguerri sorry, i should have waited for your approval before merging. but i created https://github.com/facebookincubator/contest/pull/212 so we can continue the review there.

pkg/runner/test_runner.go

marcoguerri · 2021-01-25T13:27:03Z

pkg/runner/test_runner.go

+	tr.log.Debugf("waiting for step runners to finish")
+	swch := make(chan struct{})
+	go func() {
+		tr.mu.Lock()


This is an example of the indirect data path I was mentioning before.

Intuitively, I would imagine that tr.steps is protected by tr.mu. Indeed, this goroutine acquires mu throughout its whole execution and I see we read ss.<X> and then waits for tr.cond.Wait. Is somebody else, from outside, changing stepRunning, readerRunning, etc. while we hold mu?

From looking at the code, yes, but those writers also hold mu. So, I cannot wrap my head around how this can work.

this has to do with how waiting on sync.Condition works. when you want to wait for some internal state protected by a mutex to change (in this case stepRunning and readerRunning), you acquire a mutex, test the state and if it is not to your satisfaction, you invoke tr.cond.Wait() which releases the mutex and then suspends until signaled, at which point it re-locks and returns, letting you re-examine the state again. we signal tr.cond when things change (runner exits, rader exits of target handler exits).

marcoguerri · 2021-01-25T13:32:07Z

pkg/runner/test_runner.go

+			err = nrerr
+		}
+		for _, ss := range tr.steps {
+			if ss.stepRunning {


Previous implementation here would also issue a termination signal to the whole test in a further attempt to make the pipeline shut down.

the step runner context will eventually be canceled (via defer stepCancel())

pkg/runner/test_runner.go

marcoguerri · 2021-01-25T15:11:43Z

pkg/runner/test_runner.go

-// custom timeouts
-func NewTestRunnerWithTimeouts(timeouts TestRunnerTimeouts) TestRunner {
-	return TestRunner{timeouts: timeouts}
+func (tr *TestRunner) safeCloseOutCh(ss *stepState) {


Looks like safeCloseOutCh and safeCloseErrCh can be merged together and accept an argument from outside.

yes, except channels have different types and i couldn't find a way to cast them to something common.

marcoguerri · 2021-01-25T15:16:09Z

pkg/runner/test_runner.go

+				// At this point we may still have an error to report,
+				// wait until error channel is emptied too.
+				outCh = nil
+				tr.safeCloseErrCh(ss)


If outCh indicates that the test step is guaranteed to have returned, then whoever signaled completion by closing outCh should have closed also errCh. Responsibility of closing these channels should not be spread across multiple entities.

this will be simplified when we have one channel for results (as discussed). i'll leave it as is for now.

marcoguerri · 2021-01-25T15:19:04Z

pkg/runner/test_runner.go

+					break loop
+				case <-ss.errCh:
+					break loop
+				case <-time.After(tr.shutdownTimeout):


This should not be necessary. It should be stepRunner guaranteeing to us that step termination is protected against shutdownTimeout. We shouldn't be doing it in multiple places.

pkg/runner/test_runner.go

marcoguerri · 2021-01-25T15:35:34Z

Tried to finish review of at least test_runner.go. Looks like I could not comment in #212 outside of the scope of that specific PR (which is really not ideal).

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2020

rojer force-pushed the new_runner branch from 30c1659 to 89dac8f Compare December 17, 2020 23:27

rojer requested review from marcoguerri, xaionaro, tfg13 and insomniacslk December 18, 2020 11:56

facebookarchive deleted a comment from rojer Dec 18, 2020

rojer force-pushed the new_runner branch from 89dac8f to c5ca675 Compare December 18, 2020 13:41

rojer force-pushed the new_runner branch from c5ca675 to e7dcdbb Compare December 18, 2020 13:47

This was referenced Dec 21, 2020

[test_runner] Prepare support for cleanup steps #118

Closed

[test_runner] Added support for running two pipelines back to back #119

Closed

[test_runner] Simplify use of channels in step interface #121

Closed

rojer force-pushed the new_runner branch 3 times, most recently from 4fa6f44 to ec4325d Compare December 21, 2020 17:41

tfg13 reviewed Jan 15, 2021

View reviewed changes

rojer force-pushed the new_runner branch from 09e39d9 to 56ea7da Compare January 18, 2021 11:10

rojer9-fb added 4 commits January 18, 2021 15:46

New test runner implementation

d834c4f

Simplified, added support for serializing state and resuming. Not used yet but will be.

Add goroutine leak checker

bf5b42e

Fix goroutine leak

f40a7c8

make sure correct events are emitted

5c93032

xaionaro approved these changes Jan 18, 2021

View reviewed changes

Address review comments

86b113d

rojer force-pushed the new_runner branch from 56ea7da to 86b113d Compare January 18, 2021 18:51

rojer merged commit 11388b8 into master Jan 18, 2021

rojer deleted the new_runner branch January 18, 2021 21:50

marcoguerri reviewed Jan 19, 2021

View reviewed changes

marcoguerri reviewed Jan 25, 2021

View reviewed changes

marcoguerri mentioned this pull request Jan 26, 2021

Remove unnecessary stepCancel #212

Closed

rojer9-fb mentioned this pull request Jan 26, 2021

Review comments for 208 #219

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New test runner implementation #208

New test runner implementation #208

rojer commented Dec 17, 2020

codecov bot commented Dec 17, 2020 •

edited

Loading

rojer commented Dec 17, 2020

rojer commented Dec 18, 2020

marcoguerri commented Dec 21, 2020

rojer commented Dec 21, 2020

tfg13 left a comment •

edited

Loading

tfg13 Jan 15, 2021

rojer9-fb Jan 18, 2021

tfg13 Jan 18, 2021

rojer Jan 18, 2021

xaionaro commented Jan 18, 2021 •

edited

Loading

xaionaro commented Jan 18, 2021

rojer commented Jan 18, 2021

marcoguerri left a comment

marcoguerri Jan 19, 2021

marcoguerri Jan 19, 2021

rojer Jan 19, 2021

marcoguerri Jan 19, 2021

rojer Jan 19, 2021

marcoguerri Jan 25, 2021

marcoguerri commented Jan 19, 2021

xaionaro commented Jan 19, 2021

rojer commented Jan 19, 2021

marcoguerri Jan 25, 2021

rojer9-fb Jan 26, 2021

marcoguerri Jan 25, 2021

rojer9-fb Jan 26, 2021

marcoguerri Jan 25, 2021

rojer9-fb Jan 26, 2021

marcoguerri Jan 25, 2021

rojer9-fb Jan 26, 2021

marcoguerri Jan 25, 2021

marcoguerri commented Jan 25, 2021

New test runner implementation #208

New test runner implementation #208

Conversation

rojer commented Dec 17, 2020

codecov bot commented Dec 17, 2020 • edited Loading

Codecov Report

rojer commented Dec 17, 2020

rojer commented Dec 18, 2020

marcoguerri commented Dec 21, 2020

rojer commented Dec 21, 2020

tfg13 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xaionaro commented Jan 18, 2021 • edited Loading

xaionaro commented Jan 18, 2021

rojer commented Jan 18, 2021

marcoguerri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoguerri commented Jan 19, 2021

xaionaro commented Jan 19, 2021

rojer commented Jan 19, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcoguerri commented Jan 25, 2021

codecov bot commented Dec 17, 2020 •

edited

Loading

tfg13 left a comment •

edited

Loading

xaionaro commented Jan 18, 2021 •

edited

Loading